install.packages() and library()set.wd() and get.wd()view()glimpse()head()tail()select() and why we use select()?filter()count()mutate()arrange()ntile()ntile() and case_when()group_by() and summarise()gapminder_france_tbl as an excel fileinstall.packages() and library()# install.packages("tidyverse")
# install.packages("readxl")
# install.packages("writexl")
# install.packages("ggplot2")
# install.packages("gapminder")
# install.packages("scales")
library(tidyverse)
library(ggplot2)
library(readxl)
library(writexl)
library(gapminder)
library(scales)install.packages() and library() go hand-in-hand for the first time. If you have installed a certain package previously, then you don’t need to reinstall the package every time you open a new R session. When you open a new R session, just run the library(package_name) to load that certain package.set.wd() and get.wd()set.wd() allows us to set a working directory.get.wd() allows us to see what working directory we are in right now.An R script is simply a text file containing (almost) the same commands that you would enter on the command line of R. (top left box)
The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. (bottom left box)
Lastly, environment is a collection of all the objects, variables, and functions. (top right box)
gapminder_tbl <- read_xlsx("gapminder.xlsx")gapminder_tbl3+4## [1] 7
5-3## [1] 2
3*5## [1] 15
3/4## [1] 0.75
3^4## [1] 81
a <- 3+4
a## [1] 7
a <- 95/7
a ## [1] 13.57143
Some common data types that we will learn today are:
# numeric
class(7)## [1] "numeric"
class(7.2)## [1] "numeric"
# character
class("abcd")## [1] "character"
# factor
class(as.factor("High"))## [1] "factor"
# logical
class(TRUE)## [1] "logical"
%>%. Shortcut for the pipe operator is Shift + CMD/ CTRL + Mview()view() allows us to take a look at the whole dataset.gapminder_tbl %>%
view()glimpse()gapminder %>%
glimpse()## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
glimpse() allows us to take a quick glance at the structure of our dataset. It allows us to see what type of variables are present in our dataset.head()gapminder_tbl %>%
head()head() returns the first six observations from our dataset.tail()gapminder_tbl %>%
tail()tail() returns the last six observations from our dataset.select() and why we use select()?Imagine we are working on a hypothetical dataset with 150 columns. Out of those 150 columns we only need 5 columns at max. This is when select() comes in handy.
In our dataset, let’s say we only want country, continent, year, and population as our columns.
gapminder_tbl %>%
select(country, continent, year, pop)everything() to select the rest of the columnseverything() allows us to select rest of the columns instead of manually typing them out.gapminder_tbl %>%
select(continent, everything())filter()filter() allows us to filter the observations by rows.==. This == is an equality operator. This allows you to see whether two objects are equal or not.filter(), double equals (==) means equal to and != means not equal to.gapminder_france_tbl <- gapminder_tbl %>%
filter(country == "France")
gapminder_france_tbl# gapminder_tbl %>%
# filter(continent == "Asia")
#
# gapminder %>%
# filter(year == 1952)
#
# gapminder %>%
# filter(continent != "Europe")count()count() allows us to quickly count unique values of one or more variables.sort = TRUE arranges the column in descending order.gapminder_tbl %>%
count(continent, sort = TRUE) mutate()mutate() allows us to create new columns or modify the existing columns.mutate().gapminder_tbl %>%
mutate(pop_increased_10_times = pop * 10)# gapminder %>%
# mutate(pop_increased_by_10 = pop + 10)gapminder_tbl %>%
mutate(continent = as.factor(continent)) %>%
glimpse()## Rows: 1,704
## Columns: 6
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
gapminder_tbl %>%
mutate(continent = as.factor(continent)) %>%
mutate(continent = as.character(continent)) %>%
glimpse()## Rows: 1,704
## Columns: 6
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
arrange()arrange() allows us to arrange columns in ascending (aesc(variable_name)) or descending (desc(variable_name)) order.gapminder_afg_asc <- gapminder_tbl %>%
filter(country == "Afghanistan") %>%
arrange(pop)
gapminder_afg_ascgapminder_afg_desc <- gapminder %>%
filter(country == "Afghanistan") %>%
arrange(desc(pop))
gapminder_afg_descntile()ntile() takes in your entire column and decides what cut-points to use and bins it accordingly into however many bins you want.gapminder_tbl %>%
mutate(gdpPercap_bin = ntile(gdpPercap, 3))ntile() and case_when()gapminder_tbl %>%
mutate(
gdpPercap_bin2 = case_when(
gdpPercap > quantile(gdpPercap, 0.66) ~ "High",
gdpPercap > quantile(gdpPercap, 0.33) ~ "Medium",
TRUE ~ "Low"
)
) group_by() and summarise()group_by() and summarise() always go hand-in-hand.group_by() takes an existing table and converts it into a grouped table where operations are performed. And, these operations are performed using summarise()group_by() and summarise(), make sure to ungroup().gapminder_tbl %>%
filter(year == 1952) %>%
group_by(continent) %>%
summarise(population = sum(pop)) %>%
ungroup() %>%
arrange(desc(population))ggplot() which comes from ggplot2 library.ggplot() allows us to create plots using the data.gapminder_tbl %>%
filter(year == 1952) %>%
group_by(continent) %>%
summarise(total_population = sum(pop)) %>%
ungroup() %>%
# arrange(desc(total_population)) %>%
mutate(continent = as.factor(continent)) %>%
# Visualize
ggplot(aes(continent, total_population))+
geom_col(fill = "#2c3e50", width = 0.5)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title = "Population of Different Continents in 1952",
x = "",
y = "Population",
subtitle = "",
caption = "Data Source: Gapminder")gapminder_tbl %>%
filter(year == 1952) %>%
group_by(continent) %>%
summarise(total_population = sum(pop)) %>%
ungroup() %>%
arrange(desc(total_population)) %>%
mutate(continent = as_factor(continent)) %>%
# Visualize
ggplot(aes(continent, total_population))+
geom_col(fill = "#2c3e50", width = 0.5)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title = "Population of Different Continents in 1952",
x = "",
y = "Population",
subtitle = "",
caption = "Data Source: Gapminder")gapminder_france_tbl as an excel filewrite_xlsx() allows us save our table as an excel file.write_xlsx(name_of_the_table_in_R, path = "wherever_you_want_to_save/give_a_name.xlsx")# writexl::write_xlsx(gapminder_france_tbl, path = "gapminder_france.xlsx")# install.packages("corrplot")
library(corrplot)
gapminder_tbl %>%
select(year:gdpPercap) %>%
cor() %>%
corrplot(method = "number")gapminder_tbl %>%
select(year:gdpPercap) %>%
cor() %>%
corrplot(method = "color", order = "alphabet")